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VOICE  CHANNEL  OBJECTIVE  EVALUATION 
USING  LINEAR  PREDICTIVE  CODING 


W ,J.  Hartman*  and  S.F.  Boll** 


'a’s  present  results  of  a feasibility  study 
on  the  use  of  linear  predictive  coding 
( LPC ) techniques*  for  deriving  an  objective 
measure  of  intelligibility  over  voice  communi- 
cation channels.  Background  material  is 
given  and  several  potentially  useful  measures 
are  identified.  The  limitations  of  the 
present  study  are  detailed  and  methods  of 
overcoming  these  limitations  in  future  work 
are  outlined.  In  spite  of  these  limitations, 
the  study  strongly  supports  the  suitability 
of  LPC  techniques  for  the  objective  measure- 
ment of  intelligibility. 

Key  words:  Intelligibility  testing, 

linear  predictive  coding. 


I . INTRODUCTION 

A common  procedure  for  determining  the  intelligibility  of 
voice  channels  is  to  use  a predetermined  vocabulary  with  selected 
speakers  and  a listener  panel  to  subjectively  grade  the  intel- 
ligibility after  the  spoken  words  have  passed  through  some 
voice  channel . A variety  of  such  testing  schemes  has  been 
devised.  Most  of  these  schemes  have  the  desirable  property 
of  producing  repeatable  results  which  can  be  interpreted 
in  terms  of  user  requirements.  However,  the  requirement  for 
listener  panels  greatly  restricts  the  utility  of  these  testing 
methods,  and  a long-sought  goal  has  been  to  replace  these 


*The  author  is  with  the  Institute  for  Telecommunication 
Sciences,  Office  of  Telecommunications,  U.S.  Department  of 
Commerce,  Boulder,  Colorado  80302. 

**The  author  is  with  the  University  of  Utah  and  Software 
Sciences  Corporation,  Salt  Lake  City,  Utah. 


listener  panels  with  hardware.  The  work  reported  here  covers 
one  step  in  the  direction  of  reachinq  that  goal.  This  study 
uses  a 50-word  phonetically  balanced  word  list  played  through 
five  different  voice  systems  with  a range  of  articulation 
scores  from  64.7%  to  95%  as  a data  base.  A mathematical 
technique  called ' Linear  Predictive  Coding  (LPC)  which  was 
originally  applied  to  the  analysis  and  synthesis  of  voice 
is  used  to  derive  a distance  measure  for  each  word  between 
the  original  undistorted  words  and  the  words  after  passing 
through  a voice  channel.  This  measure  is  compared  with  the 
subjective  scoring  for  each  word.  For  those  words  with  a 
range  of  subjective  scores  (%  correct)  the  distance  measure 
is  a decreasing  function  of  the  subjective  scores.  These 
results  strongly  support  the  suitability  of  using  measures 
derived  from  the  LPC  methodology  for  an  objective  measurement 
of  intelligibility. 

II.  DESCRIPTION  OF  THF  VOICE  TAPES  USED 

The  word  group  chosen  for  the  analysis  was  the  phonetically 
balanced  50-word  list  (word  group  284)  shown  in  table  1. 

This  word  qroup  had  previously  been  used  by  the  Army  Electronic 
Proving  Ground  Electromagnetic  Environment  Test  Facility, 
at  Ft.  Huachuca,  Arizona,  for  systems  evaluations  work,  and 
several  tapes  with  a range  of  articulation  scores  were  available. 
The  articulation  score  is  defined  here  as  the  percent  of 
correct  responses  by  the  listener  panel.  The  master  tape 
was  made  using  three  male  and  two  female  speakers. 

The  decision  to  use  the  prerecorded  prescored  tapes  was  made 
so  that  a range  of  articulation  scores  would  be  available. 
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Table  1 


PB  Word  Group  2 84 


1 

JELL 

21 

BIND 

41 

FOOD 

2 

SPICE 

22 

LICK 

42 

PINT 

3 

COD 

23 

CALF 

43 

ROT 

4 

CHEW 

24 

CATCH 

44 

RHYME 

5 

THREAD 

25 

DUMB 

45 

FLIP 

6 

SHACK 

26 

US 

46 

WHEEZE 

7 

BOLT 

27 

FORTH 

47 

GUESS 

8 

LOOK 

28 

YEAST 

48 

ASK 

9 

LEFT 

29 

FROCK 

49 

FAD 

10 

DEUCE 

30 

EACH 

50 

ROPE 

11 

BID 

31 

NIGHT 

12 

KILL 

32 

WIG 

13 

CRACK 

33 

QUEEN 

14 

DAY 

34 

FRONT 

15 

TILL 

35 

ROD 

16 

SLIDE 

36 

EASE 

17 

CLOD 

37 

FREAK 

18 

THIS 

38 

HUM 

19 

BORED 

39 

REST 

20 

CHANT 

40 

ROLL 
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The  scores  chosen  were  64. 7%,  73%,  78.5%,  87.3%,  and  95%. 

For  each  tape,  information  was  available  for  each  word  on 
the  number  of  correct  responses,  the  actual  responses  made, 
and  other  pertinent  information  on  the  scoring.  However, 
information  was  not  obtained  on  the  specific  system  which 
produced  the  distortion  (noise)  on  the  tapes,  except  to  assure 
that  at  least  some  of  the  distorted  tapes  came  from  digital 
systems.  Table  2 lists  the  word  by  word  scores  for  each 
of  the  tapes,  giving  the  number  and  percentage  of  correct 
responses . 

Although  the  motivation  for  choosing  the  tapes  was  sound, 
the  choice  caused  problems  in  aligning  words.  These  problems 
are  discussed  in  Section  III  and  one  solution  to  this  problem 
is  discussed  in  Section  VI. 

III.  DATA  PROCESSING 

The  analog  signals  were  first  processed  through  an  AGC  and 
low  pass  filtered  to  3.2  kHz.  This  signal  was  then  sampled 
at  a rate'  of  8 kHz  and  quantized  to  8 bits.  The  quantized 
samples  were  blocked  into  records  160  samples  long  and 
the  mean  and  standard  deviation  (SD)  was  calculated  for 
each  record.  The  SD  was  used  as  an  energy  criteria  to 
determine  the  endpoints  of  the  words,  and  the  word  "midpoints". 
The  "midpoint"  for  a word  was  defined  to  be  the  beginning  of  the 
160-point  record  which  was  centered  in  those  records  with  the 
largest  standard  deviations  for  that  word.  It  was  defined  in  th 
way  so  that  the  correlations  described  in  the  next  paragraph 
were  meaningful.  This  midpoint  was  usually  close  to  the 
point  midway  between  the  endpoints. 

For  the  tapes  with  distortion,  the  energy  criteria  were  used 
to  first  define  the  midpoint  of  the  1st  and  50th  words. 


Tabiz  2. 


Mumbea.  o £ coaaect  Kz&poni,&- & and  percent 
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Tape  2 

Tape  3 

Tape  4 

Tape  5 

Tape  6 

64.7 

73.0 

78.5 

87.3 

95 

Percent 

Percent 

Percent 

Percent 

Percent 

Word 

6 

listeners 

8 

listeners 

8 

listeners 

6 

listeners 

8 

listeners 

1 

6 

(100.0) 

7 

( 87.5) 

6 

( 75.0) 

6 

(100.0) 

8 

(100.0) 

2 

0 

( 0.0) 

3 

( 37.5) 

3 

( 37.5) 

4 

( 66.6) 

8 

(100.0) 

3 

6 

(100.0) 

8 

(100.0) 

8 

(100.0) 

5 

( 83.3) 

8 

(100.0) 

4 

5 

( 83.3) 

5 

( 62.5) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

5 

3 

( 50.0) 

6 

( 75.0) 

7 

( 87.5) 

6 

(100.0) 

8 

(100.0) 

6 

3 

( 50.0) 

8 

(100.0) 

4 

( 50.0) 

5 

( 83.3) 

7 

( 87.5) 

7 

2 

< 33.3) 

7 

( 87.5) 

4 

( 50.0) 

6 

(100.0) 

8 

(100.0) 

8 

6 

(100.0) 

6 

( 75.0) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

9 

1 

( 16.6) 

1 

( 12.5) 

1 

( 12.5) 

5 

( 83.3) 

8 

(100.0) 

10 

6 

(100.0) 

8 

(100.0) 

7 

( 87.5) 

6 

(100.0) 

8 

(100.0) 

11 

2 

( 33.3) 

1 

( 12.5) 

6 

( 75.0) 

5 

( 83.3) 

7 

( 87.5) 

12 

6 

(100.0) 

7 

( 87.5) 

8 

(100.0) 

5 

( 83.3) 

8 

(100.0) 

13 

6 

(100.0) 

8 

(100.0) 

7 

( 87.5) 

6 

(100.0) 

8 

(100.0) 

14 

4 

( 66.6) 

7 

( 87.5) 

7 

( 87.5) 

4 

( 66.6) 

8 

(100.0) 

15 

5 

( 83.3) 

5 

( 62.5) 

8 

(100.0) 

4 

( 66.6) 

8 

(100.0) 

16 

3 

( 50.0) 

4 

( 50.0) 

8 

(100.0) 

6 

(100.0) 

7 

( 87.5) 

17 

6 

(100.0) 

7 

( 87.5) 

7 

( 87.5) 

4 

( 66.6) 

7 

( 87.5) 

18 

3 

( 50.0) 

4 

( 50.0) 

5 

( 62.5) 

3 

( 50.0) 

8 

(100.0) 

19 

4 

( 66.6) 

7 

( 87.5) 

6 

( 75.0) 

5 

( 83.3) 

8 

(100.0) 

20 

1 

( 16.6) 

5 

( 62.5) 

4 

( 50.0) 

5 

( 83.3) 

7 

( 87.5) 

21 

2 

( 33.3) 

4 

( 50.0) 

1 

( 12.5) 

2 

( 33.3) 

8 

(100.0) 

22 

1 

( 16.6) 

3 

( 37.5) 

3 

( 37.5) 

3 

( 50.0) 

4 

( 50.0) 

23 

5 

( 83.3) 

5 

( 62.5) 

7 

( 87.5) 

5 

( 83.3) 

8 

(100.0) 

24 

1 

( 16.6) 

4 

( 50.0) 

5 

( 62.5) 

5 

( 83.3) 

8 

(100.0) 

25 

1 

( 16.6) 

5 

( 62.5) 

4 

( 50.0) 

6 

(100.0) 

8 

(100.0) 

26 

0 

( 0.0) 

0 

( 0.0) 

3 

( 37.5) 

4 

( 66.6) 

4 

( 50.0) 

27 

5 

( 83.3) 

8 

(100.0) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

28 

4 

( 66.6) 

5 

( 62.5) 

5 

( 62.5) 

6 

(100.0) 

8 

(100.0) 

29 

5 

( 83.3) 

8 

(100.0) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

30 

5 

( 83.3) 

8 

(100.0) 

7 

( 87.5) 

6 

(100.0) 

8 

(100.0) 

33 

6 

(100.0) 

!) 

(100.0) 

8 

(100.0) 

5 

( 83.3) 

8 

(100.0) 

32 

0 

( 0.0) 

7 

( 87.5) 

7 

( 87.5) 

5 

( 83.3) 

8 

(100.0) 

33 

6 

(100.0) 

8 

(100.0) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

34 

3 

( 50.0) 

6 

( 75.0) 

6 

( 75.0) 

6 

(100.0) 

8 

(100.0) 

35 

6 

(100.0) 

7 

( 87.5) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

36 

2 

( 33.3) 

6 

( 75.0) 

4 

( 50.0) 

3 

( 50.0) 

8 

(100.0) 

37 

5 

( 83.3) 

4 

( 50.0) 

5 

( 62.5) 

5 

( 83.3) 

6 

( 75.0) 

38 

1 

( 16.6) 

2 

( 25.0) 

7 

( 87.5) 

4 

( 66.6) 

6 

( 75.0) 

39 

4 

( 66.6) 

6 

( 75.0) 

7 

( 87.5) 

6 

(100.0) 

7 

( 87.5) 

40 

6 

(100.0) 

8 

(100.0) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

41 

5 

( 83.3) 

6 

( 75.0) 

7 

( 87.5) 

6 

(100.0) 

8 

(100.0) 

42 

6 

(100.0) 

8 

(100.0) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

43 

4 

( 66.6) 

3 

( 37.5) 

3 

( 37.5) 

6 

(100.0) 

7 

( 87.5) 

44 

6 

(100.0) 

8 

(100.0) 

8 

(100.0) 

5 

( 83.3) 

8 

(100.0) 

45 

1 

( 16.6) 

5 

( 62.5) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

46 

6 

(100.0) 

8 

(100.0) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

47 

5 

( 83.3) 

8 

(100.0) 

7 

( 87.5) 

6 

(100.0) 

8 

(100.0) 

48 

6 

(100.0) 

8 

(100.0) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

49 

6 

(100.0) 

6 

( 75.0) 

8 

(100.0) 

6 

(100.0) 

8 

(100.0) 

50 

3 

( 50.0) 

6 

( 75.0) 

8 

(100.0) 

6 

(100.0) 

7 

( 87.5) 
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Various  length  intervals  about  the  midpoint  of  word  1 were 
then  used  to  obtain  the  cross  correlation  aqainst  similar 
intervals  from  the  master  word  1,  adjusting  the  midpoint 
for  the  distorted  word  to  agree  with  the  point  of  maximum 
correlation.  Using  the  midooints  from  words  1 and  50,  a 
linear  function  was  determined  to  relate  the  distances  between 
midpoints  from  the  master  words  to  those  for  the  distorted 
words.  The  correlation  process  was  then  repeated  for  eech 
word.  Although  no  problems  were  encountered  in  aligning 
many  of  the  words,  some  of  the  words  presented  special  difficul- 
ties which  were  never  fully  resolved.  The  procedures  given 
in  Section  VI  describe  a method  for  obtaining  synchronization. 

Figures  1 through  5 show  the  difference  between  the  midpoint  of  the 
distorted  word  and  the  midpoint  of  the  master  word  for  each  of  five 
systems.  The  linear  trend  is  to  be  expected  due  to  normal  speed 
differences  of  the  1/4-inch  voice  recorders  used.  The  variation 
about  this  line  was  not  anticipated.  A possible  explanation 
is  that  these  tapes  had  been  rerecorded,  and  the  cumulative 
tape  stretchinq,  wow  and  flutter  all  contributed  to  the 
di f ferences . 

Following  the  alignment  procedure,  the  data  were  blocked 
into  frames  consisting  of  256  points.  These  frames  were 
windowed  with  a Hamming  window  of  the  form 

(0.54  + 0.46  cos  — ) for  |t|  < Tw 

TW 

elsewhere 

where  Tr,  is  the  width  of  the  window.  This  was  used  to  insure 
W 

the  stability  of  the  LPC  analysis.  These  windows  were  then 
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Word  Number 

Figure  1.  ViffeKence  be.twe.tn  the  iample  numbe*  f*om 
the  matte*,  tape  and  the  tample  numbe * jo* 
the  64.7%  {42 j tape  f o * the  midpoint & of 
wo*di  1 through  50. 
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Word  Number 


Ftguae  2. 


Vi^ea ence  between  the  sample  number  hnom 
the  mat>teK  tape  and  the  tample  numbea  ioa 
the  73 1 ( ^ 3 J tape  ^o> t the  midpoints 
wondi  1 through  50. 


Midpoint  Difference  (Samples) 


Midpoint  Difference  (Samples) 


processed  using  the  autocorrelation  method  (Appendix  A)* 
to  obtain  the  following  I.PC  parameters: 

Correlation  coefficients 
Predictor  coefficients 
Peflection  coefficients 
Frror  energy 
Pitch 


R - R._ 
o 10 

al  " a10 

K, 


1 *'10 
(see  A,  3-20) 


(see  A,  1-66) 

(Atal  and  Hanauer  1971) 


(Boll  1973) . 


The  analyses  were  done  using  16-bit  double  precision  floating 
point  arithmetic. 


The  decision  on  the  sampling  rate,  number  of  coefficients, 
and  frame  length  was  based  on  the  availability  of  hardware 
using  these  numbers  and  a software  simulation  of  this  hardware. 
This  choice  probably  reduces  the  resolution  of  the  derived 
distortion  measures,  although  in  this  study  the  data  sample 
analyzed  is  probably  too  small  to  discern  fine  grain  resolution. 
In  any  case,  it  appears  that  increasing  the  sampling  rate 
to  10  kHz  and  the  number  of  coefficients  from  10  to  12  will 
not  significantly  increase  the  complexity  of  the  equipment 
and  should  give  better  resolution. 


IV.  DISTORTION  MEASURES 

The  method  for  determining  distortion  which  was  used  in  this 
study  was  to  compare  the  predictor  coefficients  for  correspond- 
ing frames  of  the  master  tape  and  the  distorted  tapes.  This 
comparison  can  be  done  in  numerous  ways,  each  with  potential 
advantages.  This  technique  is  examined  for  speaker  verification 
by  Atal  [1974]  and  Rosenberg  and  Sambur  [1975]  where  the 


* (A, 3-1)  will  be  used  to  denote  equation  3-1,  Appendix  A. 
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comparison  is  directly  between  the  coefficients.  A different 
approach  has  been  developed  by  Makhoul  [1973]  in  the  area 
of  variable  frame  rate  transmission  and  by  Itakura  [1975] 
in  the  area  of  isolated  word  recognition.  These  methods  examine 
how  "close"  one  set  of  predictor  coefficients  is  to  another 
set  by  comparing  the  linear  prediction  residual  resulting 
from  each  set.  This  latter  technique  proved  to  be  best  for 
the  purposes  of  this  study. 

The  first  attempt  to  identify  a distortion  measure  was  based 
on  a metric  of  the  form 
P 

^ ' w.  (a . -a . ) 2 
/ ill 

i=l 

where  the  w^  represent  weights  and  the  a^  were  either  the 
predictor  coefficients  or  the  reflection  coefficients  and  the 
unprimed  and  primed  quantities  refer  to  the  master  and  distorted 
signal  parameters  respectively  as  it  will  throughout  the  remainder 
of  this  report.  No  distortion  measure  of  this  form  could 
be  found  to  relate  to  the  subjective  scores.  Some  discussion 
of  the  failure  of  this  form  of  measure  in  a different  context 
is  given  by  Itakura  [1975]. 

The  linear  prediction  residual  is  defined  for  a sampled  signal 
|sn|  (n=0 , 1 , 2 . . .N)  and  any  set  of  predictor  coefficients 
{«£}  (k=l , 2...p)  as 

N 

2 <Sn>2  - nl  <» 

n=0 
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where 


k=l 


For  this  analysis  N=256  and  p=10,  and  the  limits  of  summation 
will  be  omitted  since  no  misunderstanding  is  involved. 


If  the  predictor  coefficients  are  the  least  squares  solution 
for  the  signal  |snJ  then  (A,  3-3) 


where  E is  the  minimum  residual.  It  follows  that  _>  E 
for  any  set  of  predictor  coefficients. 


The  distortion  measure  , can  be  interpreted  as  a measure 
of  how  close  the  coefficients  derived  from  the  distorted  data 
predict  the  original  data.  Similarly,  one  could  define  a dis- 
tortion measure,  D^t  which  measures  how  closely  the  coefficients 
derived  from  the  master  data  |ak}  predict  the  distorted  data,  { s^  }. 
Several  other  distortion  measures  are  also  possible,  and 
can  be  easily  summarized  by  the  introduction  of  the  following 
notation . 


of  a , and  R=  JR  , . . ,1 

UH  l> 

values  (A,  3-16) . 

Then  we  have 

T 

E = a R a 
E '=  a ' TP ' a ' 


T 

Let  a =(1,  -a^,  . . ,-ajg)  be  the  transpose 
be  the  matrix  of  unnormalized  correlation 


a ,TP  a' 

Dj3  aTR'a. 
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A 


J 


It  is  not  difficult  to  relate  each  of  these  to  the  appropriate 
power  spectra  estimates  (A, 4-7).  Let 


= |iC  W(nT)e"^nuT 


be  the  povcr  spectrum  of  the  windowed  data  where  W(nT)=W  (nT)S(nT) 

H 

and  let 


P(u»)  = 


!-£ak  e 


- jkwTi 


be  the  linear  prediction  spectrum  where  A is  an  appropriate 
constant  (A,  4-6a) . Then,  using  the  same  convention  for  primed  and 
unprimed  quantities,  as  before,  it  can  be  shown  that  (A,  4-16) 


E = 

and  similarly, 
E'  = 


and  similarly, 
D2  = 

We  also  have 


AZT  1 

r n/T  , 

2 n J 

- n/T 

a’  2t 

f n/T 

2 n 

J- n/T 

shown 

that 

a’  2t  / 

•n/T  j 

2n  I 

-n/T  1 

rn/T 

a?t  J 

f 

nr  I 

-n/T 

T A'  2 

/ 

2n_A7 

EJi!>  du  , 

P (u>) 


P ' ( u>) 


doj  . 


dw  , 

P'  (w) 


P’  (u) 


P(U») 

n/T 


n/T  p'(u>) 


2 T A 
E'  " 2H  A 


f n/T  A 
2 f P'(u) 


-n/T 


dw 


a 


Of  all  the  measures  which  were  extensively  tested,  the  measure 
AS I defined  by 

AS  I = 10  log  £d  j^/eJ 

appears  to  give  the  best  correlation  with  the  subjective  scores. 

After  the  completion  of  most  of  the  work,  the  measure 
5 log  D^/E  + 5 log  E^/E'  was  tested  for  a limited  sample  and 
appears  to  be  superior  to  ASI.  This  measure  and  related 
measures  are  discussed  in  Gray  and  Markel  [1976]  where  the 
asymmetry  in  ASI  is  shown. 


V.  RESULTS 

For  each  word,  the  distortion  measures  (ASI) between  the 
master  and  each  distorted  tape  were  computed  for  24  analysis 
frames,  twelve  preceding  and  twelve  following  the  word 
midpoint.  An  average  of  ASI  over  the  twenty-four  frames 
was  formed,  and  this  was  compared  to  the  subjective  scoring 
for  that  word.  Figures  6(a),  (b) , and  (c)  show  plots 

of  ASI  vs  percent  correct  responses  for  three  words.  For 
those  words  which  had  a range  of  correct  responses  these 
figures  are  typical.  Of  course,  a range  of  distortion 
measures  is  possible  for  completely  understood  words,  and  this 
range  is  word  dependent  (i.e.,  different  words  can  have 
different  amounts  of  distortion  before  intelligibility  is 
affected) . 

Ideally,  the  points  in  figure  6 would  lie  on  a straight  line, 
and  a detailed  study  was  made  on  these  three  words  to  explain 
the  deviations.  The  first  factor  considered  was  the  midpoint 
and  frame  alignment.  API  was  computed  for  the  analysis  frames 
of  the  distorted  words  shifted  from  the  original  alignment  by 
+128  points  and  +256  points  and  the  average  compared  to  the 
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OBJECTIVE  MEASURE 


PERCENT  CORRECT/ 100  PERCENT  CORRECT  / 100 


PERCENT  CORRECT/ 100 


Figure  6.  Comparison  ofi  the  24  firame  average  o rf  the 
measure  ASI  with  the  percent  correct 
responses:  (a)  word  2,  (fa)  word  5, 

and  (c)  word  6.  The  numbers  by  the 
points  correspond  to  the  tapes:  2 (64.  7 % ) , 

3 (73$),  4 (78.5$),  5 (87.3$),  6 (95$). 
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subjective  scores  (step  1) . Improvement  in  correlation  was 
noted  for  words  5 and  6 with  different  alignments.  Using  a 
simple  energy  criteria  (see  Section  III)  to  determine  the 
beginning  and  ending  of  the  word  on  the  master  tape,  the 
AS I averaged  between  those  points  was  computed,  again  using 
the  shifting  of  step  1,  and  compared  to  the  subjective  scoring. 
Considerable  improvement  in  the  correlation  for  word  2 was 
noted  with  no  improvement  for  words  5 and  6.  Finally,  the 
measure  described  in  the  last  paragraph  of  Section  IV  was 
computed,  again  for  the  step  1 shifts,  and  both  with  and 
without  the  inclusion  of  the  quiet  periods  (described  above) 
in  the  average.  The  best  correlation  was  obtained  using  this 
measure  without  quiet  periods.  Calculations  of  this  type 
were  not  done  for  additional  words  because  of  the  computing 
time  involved. 

The  average  over  all  fifty  words  of  the  ASI  for  each  distorted 
tape  including  any  quiet  periods  is  shown  plotted  vs  articulation 
score  in  figure  7.  it  appears  that  using  the  steps  discussed 
above  would  appreciably  improve  this  correlation. 

VI.  SYNCHRONIZATION 

Because  of  the  frame  alignment  problem  discussed  above  several 
techniques  for  reducinq  the  problem  were  investigated.  First, 
an  instrumentation  tape  recorder  with  a phase  lock  capstan 
control  was  used  to  record  a binary  pseudo  noise  (PH)  signal 
of  length  127  at  a 1 kHz  bit  rate.  This  was  re-recorded  on  a 
second  recorder  (with  phase  lock  control) . The  two  PN  signals 
were  then  digitized  at  a 10  kHz  sampling  rate  and  cross 
correlated.  Aside  from  a linear  shift  due  to  slightly 
differing  local  oscillator  frequencies,  the  two  sequences 
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differed  by  less  than  one  sample  in  10  samples.  This  indicates 
that  the  digitized  words  from  a master  and  distorted  tape  can 
be  aligned  to  within  one  sample. 

The  PN  sequence  described  above  could  not  be  transmitted  over 
a voice  channel.  Thus,  an  FSK  modem  was  used  to  transmit  the 
PN  signal.  Preliminary  results  indicate  that  a length  127  PN 
signal  at  635  bits/second  driving  an  FSK  modem  and  band  limited 
to  2500  kHz  can  be  used  to  align  words  within  3 (10  kHz)  samples, 
even  after  the  signal  has  been  severely  distorted. 

VII.  CONCLUSIONS 

Because  of  the  time  and  expense  involved  in  making  subjective 
measurements  over  voice  channels,  an  objective  measurement 
technique  is  desired.  Such  a technique  should  relate  to 
the  subjective  scoring.  This  study  has  examined  the  feasibility 
of  using  an  analysis  of  the  same  data  used  for  subjective  scoring. 
The  results,  although  hampered  by  a small  data  base  with  some 
inherent  inaccuracies,  indicate  that  the  objective  measure  developed 
here  can  be  used  as  a predictor  of  subjective  scores. 

Methods  for  eliminating  the  major  problems  encountered  with 
the  data  have  been  tested  and  found  adequate. 

Several  refinements  of  the  basic  measure  were  tested  and  were 
found  to  give  better  predictions.  Additional  refinements  appear 
to  be  possible  for  a small  increase  in  the  complexity  of 
computations  once  the  accurate  frame  alignment  is  achieved. 
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APPENDIX  A 


This  appendix  is  reproduced  from  Makhoul,  J.I,.,  and  J.J.  Wolf 
(1972),  Linear  prediction  and  the  spectral  analysis  of  speech, 
Bolt  Beranek  and  Newman,  Inc.,  Cambridge,  MA  with  permission 
of  the  authors.  Dots  indicate  where  material  of  the  origianl 
report  has  been  omitted.  However,  the  original  chapter  and 
equation  numbering  is  retained. 

The  marginal  notes  indicate  points  particularly  pertinent  to 
the  present  study. 


A-l 


INTRODUCTION 


1 . 1 Hlltoucil  Overview 

On*  of  the  most  important  methods  of  speech  analysis  has 
been  the  use  of  the  short-time  spectrum.  This  has  been  accom- 
plished in  different  ways  and  to  different  ends  durinq  the  past 
25  years.  The  first  major  breakthrough  was  the  invention  of  tha 
sound  spectrograph  (Koenig,  Dunn  and  Lacey,  1940)  which  is  still 
used  extensively  for  the  spectral  analysis  of  speech.  In  1900, 

G,  rant  published  the  classic  Acoustic  Theory  of  Speech  Production 
which  laid  the  foundations  for  Many  of  the  different  methods  of 
speech  analysis  that  followed.  As  a direct  result  of  tne  signifi- 
cant advances  that  occurred  in  understanding  the  acoustics  of 
speech  production,  and  with  the  aid  of  high-speed  digital  compu- 
ters, the  method  of  analysls-by-synthesis  was  given  new  impetus 
at  M.I.T.  (Bell,  Fujisakl,  lleinx,  Stevens  and  House,  19(1).  A 
bank  of  JO  band-pass  filters  was  used  in  their  analysis.  Another 
landmark  waa  the  pitch-synchronous  analysis  of  voiced  sour.cs  as 
reported  by  Mathews,  Miller  and  David  (1901)  at  Bell  Labs.  They 
actually  used  analysis-by-synthesis  on  the  spectrum  of  a single 
pitch  period  obtained  by  a rourier  analysis  of  the  sarpleu  wave- 
form. In  1904,  A.M.  Noll  introduced  the  cepstrun  for  the  purpose 
of  pitch  extraction.  The  cepstrum  waa  later  used  as  tre  basis  for 
a formant  tracking  system  (Schafer  and  Rabiner,  1970),  This  veri- 
brief  review  gives  a representative  sample  of  tnc  ideas  am.  netno- 
uologlcs  that  have  had  a definite  effect  on  the  types  of  speech 
analysis  that  many  speech  researchers  have  chosen  to  pursue. 
more  complete  review  can  be  found  in  rianagan  (1972). 

1.2  Linear  Prediction 

The  past  two  years  have  witnessed  a surqe  of  interest  on  the 
part  of  the  speech  community  in  a methoa  of  analysis  known  alter- 
nately as  predictive  coding,  linear  prediction,  Prony's  netnod, 
inverse  filtering  formulation,  etc.  This  surge  of  interest  has 
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beer.  also  accompanied  by  an  air  of  confusion.  Two  i air  reasons 
for  this  confusion  are: 


ID  > 1 acv  of  exposition  on  the  similarities  one.  differen- 
ces between  different  formulations. 

(21  A resurfacing  of  sane  of  the  problems  (e.g.  tindowing, 
preenphasis,  etc.)  asrocisteb  vitl.  accepted  ret.  ecs  fer 
computation  of  short-tine  spectre. 

re  shall  attempt,  in  this  report,  to  deal  with  these  prob- 
lems by  relatlnn  a few  of  these  formulations  to  each  ot  er. 

bet  us  first  discuss  what  these  formulations  have  ir.  canon. 
As  far  as  we  can  ascertain,  all  the  methoos  mo  have  irsoecteu  have 
exactly  one  thing  in  cormiont  they  all  assuia  that  at  a particular 
instant  in  time,  a speech  sample  s(nT)  can  be  approx  irated  by  a 
linearly  weighted  sirsnation  of  tho  past  p samples,  where  p is 
some  Integer. 


or 


P 


e (nT)  a>  £ a^  s(nT-kT) 
k-1 
P 


(1-1) 


where  T Is  the  sampling  interval,  n is  the  sample  number,  and  a^, 
lsk«p,  are  the  weights.  Equivalently,  given  p samples  of  a speech 
signal,  the  following  sample  can  be  predicted  approximately  by  a 
linear  suntnation  of  the  p known  samples,  hence  the  term  'linear 
prediction*.  Henceforth  we  shall  use  the  term  'linear  prediction* 
at  s generic  name  for  any  method  that  makes  an  assumption  equiva- 
lent to  that  in  (1-1). 

The  problem  at  hand,  as  put  forth  by  linear  prediction,  is 
to  compute  a set  of  predictor  coefficients  a^  such  that  (1-1) 
holds  optimally  over  a specified  period  of  time.  It  is  in  compu- 
ting the  set  of  coefficients  a^  that  different  formulations  of 
linear  prediction  have  evolved. 

The  assumption  in  (1-1)  could  be  made  for  any  signal,  be  it 
speech  or  not.  The  reason  that  this  aasisnption  works  well  for 
speech  is  that  it  is  basad  on  s model  of  speech  production  which 
has  been  shown  to  work  quite  well  in  analysis-synthesis  systems 
(rent,  1960) . Basically,  the  model  eastwes  an  all-pole  transfer 

function  of  the  combined  effects  of  the  glottal  source,  the 
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vocal  tract  and  radiation.  These  poles  can  be  computed  by 
solving  a polynomial  in  t with  coefficients  a(. . A more  detailed 
description  of  this  model  is  given  in  Chapter  II. 

Theoretically  there  exist  an  unlimited  number  of  ways  in 
which  to  compute  the  coefficients  a..  However,  we  shall  initially 
limit  our  discussion  to  tf rec  formulations  which  we  feel  to  be 
representative  of  the  possible  methods  of  analysis,  and  which 
raise  some  interesting  issues.  We  shall  describe  briefly  each  of 
the  formulations  and  give  representative  references  on  each  with- 
out attempting  to  give  a complete  bibliography.  The  three  methods 
will  be  given  mnemonic  names  for  case  of  reference. 

-x ict  Method 

This  method  assumes  tat: 

(a)  The  signal  is  cell  nee  for  exactly  2p  consecutive  values. 

(b)  A speech  sample  can  be  predicted  exactly  from  the  past 
p samples,  ana  that 

(c)  This  holds  for  the  trailing  p consecutive  samples. 

These  assumptions  are  represented  by  the  following  set  of  equations: 

P 

Z *k  sn-k  ■ V n*°-1 P-1*  a*J) 

k-1 

These  arc  p equations  in  p unknowns  which  in  general  can  be 
solved  for  the  coefficients  a^,  llkip. 

Covariance  Method 

This  ncthod  assumes  that: 

(a)  The  signal  is  defined  for  p*N  consecutive  values, 
where  N is  sonc  intogcr. 

(b)  A speech  sample  can  be  approximately  predicteu  fro: 
the  past  p samples,  and  that 

(c)  This  holds  for  the  trailing  ti  consecutive  samples. 

(d)  The  total-squared  error  between  the  reel  signal  and 
its  predicted  value  is  minimized  over  the  N consecu- 
tive samples.  (Some  prefer  to  use  the  meen-souared 
error  instead  of  total-aquared  error.  The  differer.ee 
in  this  case  is  a division  by  a constant  i:  which  does 
not  affect  the  result*  of  minimization.) 

The  minimization  of  error  result*  in  the  following  set  of  equa- 
tions (detailed  derivation  is  shown  in  Section  1.1): 
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(1-3) 


where 


*ik  “ *10' 

N-l 

Z "n-i  *n-k  * 
n»0 


(1-4) 


Aqain  we  have  p equation*  in  p unknown*  which  can  he  solved  to 
obtain  the  coefficients  »y , lihlp.  Yho  coefficients  *ik  form  a 
covariance  matrix,  hence  the  name  'Covariance  Method.”  Equa- 
tions such  as  (1-3)  are  known  in  least-squares  terminology  as  the 
normal  equations  of  the  process  (Hildebrand,  1956,  p.  260) . In 
this  case  we  shall  call  (1-3)  the  Covariance  normal  equations, 
or  alternately  the  covariance  normal  matrix  equation, 

Autocorrelation  Method 


The  assumptions  made  in  this  nethod  arc: 

(a)  The  signal  is  defined  for  all  time  such  that  it  is 
identically  zero  outside  a portion  of  the  signal  II 
sample*  long,  where  M is  sone  integer.  This  is 
equivalent  to  multiplying  the  speech  signal  by  a 
finite  window  of  length  I). 

(b)  Each  sample  can  be  approximately  predicted  from  the 
past  p samples,  and  that 

(c)  This  is  true  for  all  time. 

(d)  The  total-squarod  error  between  the  actual  signal 
and  its  predicted  value  is  minimized  for  all  time. 

The  minimization  of  error  results  in  the  following  set  of  equa- 
tions (tha  derivation  is  given  in  Section  3.1) s 
P 


Z ak  n;i-k! 


, 1-1,2 


(1-5) 


k-1 


N-l-'l| 

where  - £ •„*n*|i|.  d-6) 

n-0 

Again  (1-5)  fora*  p equations  with  p unknowns  to  be  solved  for 
the  coefficients  ak> 


The  are  autocorrelation  coefficients  of  the  signal.  The 
coefficients  *n_k|  tor*  a special  matrix  which  we  shall  call  the 
autocorrelation  matrix  (as  opposed  to  the  covariance  matrix  in 
the  Covariance  method).  Also,  wa  shall  call  eauations  (1-5)  the 
Autocorrelation  normal  equations  or  alternately  the  Autocorrcla- 
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1 1 on  normal  matrix  equation. 


As  we  shall  see  in  Chapter  XV,  there  are  other  possible  for- 
mulations for  the  Covariance  and  Autocorrelation  methods,  ".'ho 
assumptions  made  above  do  not  all  apply  in  the  other  formulations. 
However,  all  Covarlance-type  formulations  have  (1-3)  in  comon, 
and  all  Autocorrelation-type  formulations  have  (1-5)  in  common, 
but  (1-4)  and  (1-6)  will  not  necessarily  apply. 

This  concludes  our  brief  description  of  each  of  three  formu- 
lations for  linear  prediction.  Now,  we  shall  relate  the  work  of 
some  researchers  to  thase  three  methods.  The  so-called  Pror.y’s 
method  (Hildebrand,  1956,  p.  371)  or  tha  exponential  approximation 
method  is  equivalent  to  the  Exact  method  for  N ■ p and  to  the 
Covariance  method  for  Nip.  A paper  by  Atal  and  llanauer  (1971), 

For  the  purposes  of  modelinq  speech  production,  we  approxi- 
mate the  continuously-vary ing  vocal  tract  shape  by  a diseretely- 
varyihq  vocal  tract  shape,  l.e.  a vocal  tract  whose  shape  changes 
at  discrete  time  Intervals.  Such  a time  interval  shall  be  call«d 
a “franc*,  within  a frame,  the  vocal  tract  shape  Is  considered 
to  be  fixed  and  can  be  modeled  by  a linear  tiise-invarlant  filter. 

This  model  of  soeech  production  has  been  used  effectively  In 
speech  synthesis  systems.  In  linear  prediction  the  linear  filter 
is  restricted  to  be  all-pole. 

Thus,  the  model  of  speech  production  used  in  llneer  predic- 
tion consists  of  the  followinq  three  assumptions: 

(1)  Within  a short  Interval  of  time  (on  the  order  of 
10-25  msec)  the  human  vocal  tract  is  assumed  to  be  fixed  in  shape. 

We  shall  refer  to  such  an  interval  as  a “frame*. 

(2)  Within  any  frame,  we  assume  that  the  transfer  function 

of  the  combined  effects  of  the  glottal  flow,  the  vocal  tract  (includ- 
ing the  ora)  and  nasal  cavities)  and  the  radiation  characteristic, 
can  be  nodeled  by  a linear  tine-invariant  all-pola  filter  with 
either  a sequence  of  impulses  or  white  noise  (or  a combination 
of  both)  as  input  Isee  Fig.  2-1). 
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VOICED 


(b)  TIME  - DOMAIN  MODEL 


Pig.  3-1.  Discrete  model  of  speech  production  employed 
In  linear  prediction  methoda. 

(!)  The  speech  signal  can  be  considered  as  the  output  of 
such  an  all-pole  filter  whose  coefficients  change  at  discrete  in- 
tervals of  time  (on  the  order  of  10  msec). 

Below  we  shall  focus  our  attention  on  a single  frame  where 
the  all-pole  filter  is  assumed  to  be  time-invariant.  Pig.  2-la 
shows  a schematic  of  the  model  in  the  frequency  domain.  The  comp- 
plea  variable  1 is  defined  by: 

, - e,T  - e 

where  s • o*jw  is  the  Laplace  operator, 

u - 2"f  is  the  radian  frequency  in  rad/sec, 
a is  the  damping  factor  in  rad/sec, 

T • y is  the  sampling  Interval  in  seconds, 
and  f is  the  sanplinq  frequency  in  Ht. 

e e • 

Figure  2-la  it  interpreted  as  follows:  Speech  is  either  voiced, 

fricated,  or  both.  (Throughout  this  report  we  shall  assume  that 
aspiration  is  a kind  of  frication.)  Voiced  speech  it  produced 
by  applying  a sequence  of  impulses,  spaced  at  the  pitch  period, 
to  a digital  filter  of  the  form: 
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whar*  l«k«p  ar*  the  flltar  coefficients, 

A la  a multiplicative  gain  factor  that  control*  tha  signal 
amplitude, 

P 

and  till)  - 1-  £ l'k  (2-3) 


ia  tha  lnvaraa  filter. 

The  output  of  the  flltar  Six)  la  a(nT),  tha  speech  samples.  Frl- 
catad  speech  Is  produced  by  applying  a sequence  of  white  noise 
saaiples,  spaced  T seconds  apart,  to  a filter  of  the  font  Sit). 
Voiced  fricatives  are  produced  by  a combination  of  voicing  and 
frication.  The  filter  Six)  represent*  tha  combined  transfer  func- 
tion of  the  glottal  flow,  the  vocal  tract  and  radiation.  Th*  poles 
of  the  filter  Six)  can  be  determined  by  solving  for  the  roots  of 
the  polynomial  in  t in  the  denominator  of  Six). 


^ t 


a- a 


CHAPTER  III 


LINEAR  PREDICTION  ANALYSIS 


In  this  chapter  we  ahall  derive  in  the  time-domain  the  Covar- 
iance and  Autocorrelation  normal  equations  (1-1)  and  (1-S)  and 
suggest  algorithms  for  computing  the  predictor  parameters.  Given 
the  normal  equations,  the  mininven  squared  error  it  defined.  The 
stability  of  the  linear  predictor,  an  Important  issue  for  speech 
synthesis,  will  then  be  examined  for  the  three  formulations  of 
linear  prediction.  We  then  take  a look  at  aona  autororrelation- 
donain  properties  of  linear  prediction.  A method  for  the  computa- 
tion of  the  gain  factor  A in  Sir)  will  be  specified. 


3. 1 Derivation  of  Covariance  and  Autocorrelation  Normal  Equations 
following  the  linear  prediction  speech  production  model  des- 
cribed in  Section  2.1  and  represented  by  (2—6) , we  shall  assume 
that  a sampled  speech  signal  s(nT)  at  tine  t«nT  can  be  approxi- 
mately predicted  by  a linear  weighted  sumation  of  the  past  p 
samples.  Let  this  approximation  to  s(nT)  be  s(nT).  We  havei 


P 


where  a^,  l!k<p,  is  a set  of  real  constants  representing  the  pre- 
dictor coefficients,  and  p is  some  integer  whose  value  is  deter- 
mined as  described  in  Sections  2.4  and  5.6. 

Let  the  error  between  the  actual  value  and  the  predicted  value 
be  given  by  en,  where ■ 


Z N Vk  • 


(3-2) 


k>l 


The  problem  is  to  find  ak,  l!k:p,  such  that  the  error  en  is  mini- 
mised in  aoaie  sense  over  the  desired  range  of  signal  samples. 

Both  the  Covariance  and  Autocorrelation  methods  employ  a least- 
squares  minimisation  procedure  since  it  leads  to  a mathematically 
attractive  solution.  Denote  the  total-squared  error  by  E,  de- 
fined as  i 


Z*n  * 2>n  * in'*  ' 
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0-31 


The  range  over  which  the  summation  In  (1-1)  applies  and  the  defi- 
nition of  *n  in  that  range  it  of  laportanca.  Indeed,  this  la  ex- 
actly where  the  difference  between  the  Covarianca  and  Autocorrela- 
tion methods  lies.  However,  let  us  first  mininiae  E without 
specification  of  the  range  of  the  siauaatloe.  Substituting  (1-1) 
in  (1-1)  we  obtains 

E"  £*«-  E Wk1’  • (,-4> 

k-1 

The  problem  reduces  to  finding  the  condition  that  minimises  the 
total-squared  error  E with  respect  to  a^,  llkip.  Tliia  condition 
la  obtained  by  setting  to  taro  the  partial  derivative  cf  E with 
respect  to  each  a^t 
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Rearranging  terms  and  interchanging  auautions  we  rfetaim 
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T *k  Y.  Vk  Vi  * E Vi- 

k-1  n n 


(3-7) 


Equations  (1-7)  are  known  as  the  normal  equations.  Tor  any  defi- 
nition of  the  signal  sn,  (1-7)  forms  a set  of  p equations  with  p 
unknowns  which  can  be  eolved  for  the  predictor  coefficients  afc. 
Now,  we  shall  derive  the  Covariance  and  Autocorrelation  normal 
equations  from  (1-7). 


Covariance  Normal  Equations 

Referring  back  to  the  assumptions  of  the  Covariance  method 
in  Section  1.2,  the  avasnatlon  over  n in  (1-1)  and  hence  in  (1-7) 
must  go  over  N consecutive  aignal  samples.  Without  loss  of 
generality,  we  let  the  range  of  aismsation  over  n bet  n-0 , 1 , . . . ,N-1 . 

We  can  now  write  (1-7)  aai 

P 

E *k  *ik  ’ *10  * 1-1#*»*--.P  (1-*) 

k-1 

N-l 

wh*r#  *lk  ’ E Vi  *n-k  * <*-•> 

n-0 
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NOTES 


Not*  that  (1-8)  and  (1-9)  arc  Identical  to  (1-1)  and  (1-4),  and 
tha  derivation  of  tha  Covariance  nonal  aquation*  ia  canplata. 
from  (1-8)  and  (1-9)  v*  note  that  valuaa  of  a.  for  n— p,...,-l, 
0,1,..., n-1,  must  be  known.  Therefor*  the  signal  sn  rust  be  de- 
fined for  p+N  consecutive  values,  es  stated  in  Section  1.2. 
Autocorrelation  Normal  Equations 

From  the  assisnptions  in  Section  1.2  we  can  define  the  signal 
*n  as  followst 

Isoase  sampled  signal,  n-0 , 1 , . . . ,N- 1 , 

.1-10) 

0,  otharwlee. 

Tha  windowed  aiqnal  a la  dafinad  for  all  m —<»<♦•.  Equation 
n 

(1-7)  bacoaaai 


This  is  the 
method  used  in 
the  present 
study.  (See 
Sec.  Ill) 


Z 

k-l 


l*i;p 


f 1-1 1 > 


Subetl tuti nq  ■ - n-i  in  (1-11)  v*  obtaini 
P 

Z 

k-l  n— • 


Z*k  Z*.  '-♦l-k  " Z%  Vi  ' lsl!p-  <*-»*> 


By  definition,  the  autocorrelation  function  R.  of  tha  aiqnal  a 

* n 

la  qlran  by 


"l  ■ Z *n  Vein- 


and 


R-l  - Rt  . 


Therefore,  f 3— 12)  reduces  to* 


(3-13) 

(3-14) 


P 

z 

k-l 


Z *k  Bll-k|  “ *1'  l“1,J 


(1-15) 


Now,  since  *n  ia  defined  in  0-10)  to  be  identically  aero  for 
n«0  and  n»H,  (3-13)  reduces  to* 


N-1- I i I 


"i  * z 


■n  n+ 1 i | ’ 


()-«> 


n-0 


Equation*  (1-15)  and  (1-18)  are  identical  to  (1-5)  and  (1-8), 
and  tha  derivation  of  the  Autocorrelation  normal  aquation*  ia 
complete. 
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J.  2 Computation  of  Predictor  Pararxttf 

In  each  of  the  thraa  formulations  of  llnaar  pradlction  pre- 
■anted  in  Section  1.2  (eqs.  1-2,  3-8,  3-15),  the  predictor  coef- 
ficients a^,  l5kip,  can  be  computed  by  solving  a set  of  p equa- 
tions with  p unknowns.  There  exist  several  standard  methods  for 
performing  the  necessary  computations,  e.g.  the  Gauss  reduction 
or  elimination  method  and  the  Crout  reduction  method  (Hildebrand, 
1956,  pp.  428-434).  These  methods  are  general  and  can  be  u*ed 
with  the  Exact,  Covariance  and  Autocorrelation  formulations.  How 
ever,  we  note  from  the  Covariance  and  Autocorrelation  normal  equa 
tions  (3-8)  and  (3-15)  that  the  matrix  of  coefficients  in  each 
case  is  a covariance  matrix.  The  coefficients  4^  in  (3-8)  form 
a typical  covariance  matrix  and  the  coefficients  R^^)  in  (3-15) 
form  a special  type  of  covariance  matrix  known  as  an  autocorrela- 
tion matrix.  A covariance  matrix  is  symmetric  and  in  general 
positive  semidef inite,  but  in  practice  these  covariance  natrices 
are  usually  positive  definite.  Therefore,  (3-8)  and  (3-15)  can 
be  solved  more  efficiently  by  the  square-root  method  (Kunz,  1957, 
pp.  222-225) . This  method  also  requires  about  half  the  storage 
of  the  general  methods.  A similar  method  that  does  not  employ 
the  square  root  operation  has  been  reported  by  Wilkinson  and 
Reinach  (1971,  pp.  9-30).  further  reduction  in  storage  and  com- 
putation time  is  possible  in  solving  the  Autocorrelation  normal 
equations  because  of  their  special  form. 

• 


3. 3 Hi  nines  Total-Squared  Error 

The  predictor  coefficients  *k  are  determined  such  that  the 
total-squared  error  E in  (3-4)  is  minimized.  After  computation 
of  the  coefficients  *k  using  one  of  the  methods  mentioned  in 
Section  3.2,  one  should  be  able  to  compute  the  minimum  total- 
squared  error  by  substituting  for  the  computed  coefficients 
ak  in  (3-4).  (Note  that  there  is  no  error  criterion  associated 
with  the  Exact  method.)  Thust 
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Substituting  (J-7),  the  condition  for  the  minimisation  of  E,  and 
collecting  tanas,  wo  obtain  tha  minimis  tot  a 1- squared  arror  Epi 

r> 

EP*Z*"  ‘ Z •“D"  • <3'1" 

n k«l  n 

In  particular*  for  the  Covariance  method,  n ranges  frort  0 to  tl-i. 
Thus,  substituting  (3-9)  in  (3-18)  we  obtain  the  minimum  total- 
squared  error  in  the  Covariance  method: 

P 

Ep  ■ *00  * \ *0k  • (Covariance  Method)  (3-19) 

kM 

In  the  Autocorrelation  method  n ranges  from  -•  to  Substitut- 

ing (3-13)  in  (3-18)  we  have: 

P 

E(  ” *0  * ^ *k  Rl  ’ *Autocorrel»tl<>h  Method)  (3-20) 
kM 

We  shall  have  the  chance  in  Chapter  V to  discuss  the  be- 
havior of  this  minimum  error  in  the  Autocorrelation  method  as  a 
function  of  p and  the  autocorrelation  function.  In  particular, 

we  shall  be  interested  in  the  normalised  error  V defined  by: 

P 


here 


energy  in. the  predictor  error  samples 
enefgvTn'fRc’spcecK'sIgnal  (3-21) 


V 

n 


1 


P 


k-1 


( 3- 22a ) 


for  all  k , 


( l-22b) 


and  the  samples  rk  will  be  known  as  the  normalised  autocorrelation 
function.  (Levinson  (1947)  uses  the  notation  V,  Market  (SCRL  lion., 
1971)  uses  n,  and  Atal  and  llonauer  (1971)  use  t for  the  normalised 
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•rror.  Wa  hava  choaan  tha  lattar  V bacauaa  of  tha  poaaibla  uaa- 
fulnaaa  of  tha  noraialliad  arror  In  tha  Indication  of  wldmi.l 
Nota  that  dividing  (3-11)  by  lg  and  ualng  (1-22L)  wa  obtalni 

P 

£*k  r|l-k|  - rl  ' 1,4!P  * n-n' 

k«l 

Equation  (1-21)  aaya  that  tha  pradictor  coafflcianta  can  alao  ba 
ccaiputad  ualng  tha  nornalliad  autocorralatlon  laplaa  r^.  Prcn 
< 1— 22b)  and  tha  fact  that  rk  la  an  autocorralatlon  function  va 
havai 

r0  " 1 

and  |rk|  s 1,  for  all  k.  (1-24) 

The  algnal  total  anargy  kfl  can  vary  wldaly  for  dlffarant  algnala, 
which  alght  cauaa  round-off  problaaa  In  trying  to  aolva  (1-1S)  in 
a digital  conputar  with  only  lntaqar  arithnatic  capability. 

For  auch  caaaa  it  would  ba  usaful  to  normallxa  tha  autocorrala- 
tion  coafflcianta  firat  by  ualng  ( 1— 22b) , and  than  aolva  for  tha 
a 'a  ualng  (1-21) . 

* a 
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CHAPTER  IV 

SPECTRAL  ESTIMATION  AND  ANALYSIS- AY-SYVTItr SIS 

In  Chapter  III  the  Covariance  and  Autocorrelation  Kathode 
of  linear  prediction  were  derived  Iron  a time-domain  fonaulatlon. 
In  thle  chapter  we  ahall  a how  that  the  aaiae  normal  oquationa  can 
be  derived  from  a frequency-domain  formulation.  It  will  become 
clear  that  linear  prediction  can  be  considered  equally  validly 
as  either  a tine-domain  or  a frequency-domain  type  of  analysis. 

rirst,  the  Autocorrelation  method  is  reinterpreted  in  terms 
of  an  inverse  filter  formulation.  This  leada  directly  to  linear 
prediction  analysis  in  the  frequency  domain,  'he  Autocorrela- 
tion method  is  rederived  from  the  apectral  domain  by  approximating 
the  siqnal  short-time  spectrum  P(w)  by  an  all-pole  power  spectrum 
P(w).  An  error  criterion  between  the  two  spectre  is  defined  and 
minimised.  The  results  ara  interpreted  in  terms  of  traditional 
methods  of  spectral  analysis-by-synthesis.  The  Autocorrelation 
method  is  then  reformulated  in  terms  of  a direct  and  an  indirect 
method  by  relatinq  to  the  cor  respond inq  methods  of  estimation  of 
power  spectra.  An  analoqous  reformulation  of  the  Covariance 
method  is  derived  from  a generalized  method  of  analysis-by-syn- 
thesis  where  the  siqnal  is  assumed  to  be  nonstationary  and  the 
two-dimensional  short-time  power  spectrum  0<w,u')  la  to  be 
approximated  by  an  all-pole  two-dimensional  spectrum  Q(v.w'). 

e e e 

4.1  Inverse  filter  fonaulatlon 

The  linear  prediction  error  eR  wee  defined  by  0-2),  and 
la  repeated  here  for  con van lance ■ 

P 

*n  * *n  ‘ £ *k  *n-k  * 13-2) 

k-1 

Since  the  siqnal  s_  is  defined  for  all  tine,  then  a.  is  also  da- 

n n 

fined  for  all  time.  Therefore,  we  can  taka  the  s-tranaforn  of 
0-2)  by  multiplylnq  both  sides  of  the  equation  by  s~n  and  sum- 
alnq  over  all  n a a a 

The  result  1st 
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1-1 


» Sit)  Hit)  , 


whara  Clt)  and  Sit)  ara  tha  t-transforma  of  aB  and  a#,  reapec- 
tlvaly,  and  H(t)  ■ 1-  ^ a^  *"k  was  alraady  daftnad  la  12-})  at 
tha  lnvaraa  (iltar. 

Troa  (4-1),  tha  arror  signal  a_  can  ba  lntarpratad  at  tha  output 

n 

of  a flltar  Hit)  whoaa  Input  la  tn.  aa  ahown  In  Fig.  4-1. 


rig.  4-1. 


The  error  sequence  a aa  tha  output  of  an  in- 
verse filter  Hit). 


Therefore,  another  way  to  view  the  error  minimisation  problem  In 


Section  1.1  is  to  tolve  for  the  parameters  afc  of  the  Inverse  fil- 
ter It(a)  which  will  minimiie  the  energy in  the  output  error 
signal,  for  a given  value  of  p.  This  is  what  Martel  calls  the 
inverse  filter  formulation  (Martel,  1472). 
equation  (4-1)  can  be  solved  for  Sit)  to  obtain: 

(4-2) 


;it) 


Li il 

hTi  ) 


-Lit] 

p 


1 


t-i 


(4-2)  is  an  exact  equation,  according  to  the  speech  production 
model  described  in  Section  2.1,  if  the  signal  Sfl  la  the  vocal 
tract  response  due  to  a single  pitch  pulse,  then  the  transfer 
function  Six)  can  be  approximated  by  an  all-pole  filter  fit)  gl 
ven  by  12-2)  and  shown  below: 


S (t) 


A_ 

H(l) 


P 


(2-2) 


Comparing  (2-2)  and  (4-2)  we  conclude  that  til)  la  approximated 
by  another  function 

Bit)  • A , 

which  corresponds  to  a time-domain  approx i station  en  given  by: 


A- 16 


where  4 _ la  tha  kronecker  delta  daflnad  by  (3-24). 

Iw 

an  la  juat  an  lmpulaa  of  magnitude  A.  no u,  In  ordar  to  conaarva 

enerqy  between  1 and  a.  « aiuat  hava 
n n 


[V’E  *n2  • 


(4-4) 


After  tha  minimisation  of  tha  total-aquared  arror,  tha  right- 
hand  a Ida  of  (4-4)  la  aqual  to  tha  ainlma  total-aquared  arror 
f.p  given  by  (1-20).  Tha  laft-hand  aida  of  (4-4)  la  datamtlnad 
aaally  from  (4-)),  and  wa  have: 

P 

*2  * CP  " *°  ' ^ **  * 

Tha  raault  la  Identical  to  (J-J7)  which  waa  derived  by  anarqy 
conaervation  between  tha  aignal  aB  and  tha  iapulae  raaponae  of 
81a). 

Tha  above  analyaia  aaeumed  that  tha  vocal  tract  waa  excited 
by  a tingle  pulee.  The  game  reaulta  would  be  obtained  If  one 
attuned  a white  noise  aource  excitation. 

I 

4 . 2 rrror  Minimisation  In  the  Spectral  rionain 

In  this  aectlon  we  shall  show  that  the  Autocorrelation  nor- 
mal aquatlona  (3-15)  can  alto  be  derived  completely  in  tha  fre- 
quency domain.  Before  we  proceed,  we  ahall  define  the  power  tpec- 
trum  of  a transfer  function  Y(i)  aa  tha  magnitude  tquared  of  Yd) 
evaluated  on  the  unit  circle,  i.a.  t - e^“T.  Y(«)  evaluated  at 

i • ^“T  will  bo  denoted  by  y(w),  ao  that  tha  power  toectrum  la 
qlvan  by: 


Power  Spectrum  • Y(w)  V(w)  (4-5) 

- 1 Y (w> | 2 , 

where  tha  over-bar  denotea  complex  conjugate. 

Let  the  powar  ■ pec t rum  of  §(s)  be  denoted  by  P(w),  and  of  S(t)  by 
P(w),  than: 


and 


P(w)  - |S(w)|2 


P(w)  • | SMI* 


(4-6a) 


(4-«b) 


Ke  shall  call  P(w)  the  linear  prediction  or  approximate  spectrum 
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•nd  P(«)  the  actual  or  signal  spactrun.  Hat  hod*  for  computing 
P<“)  and  P(“)  ara  9 Ivan  in  kppandla  C. 

Making  uao  of  Parsaval'a  thaoraa  laaa  kppandla  A) , the 
total-aquarad  arror  E can  ba  rapraaantad  by : 

f 1 T T * 

E * * J?  / *• 

n»-«  - i/T 

r/T 

"JT  f V“>  iw  • M-U 

-in 


whara  P#(w)  la  tha  arror  powar  apactruai. 

Proa  linaar  ayataa  thaory,  wa  have  froa  Pig.  4-li 

P#lw>  - Hu)  lH(u)l2  , 14-9) 

whara  N(w)  la  oqual  to  Hit)  avaluatad  for  t • a^*/T. 

Substituting  ( 4- ■)  In  (4-7 ) wa  havai 

E " i»  / p(m>  H("’  p("' 

■ Hr  H' 


(4-9) 


roll owing  tha  aaaa  procadura  In  Sactlon  1.1,  E la  alnlaitad  by 

aattlna  4r~  • 0,  l!l«p  1 
**1 


dw»0 


•r  r P 

~ J P(w)  coa(lwT)  My  coal  (l-k)wT) 

-•/T  [ to 


dw  • 0 . 


Interchanqinq  integration  and  summation  we  have: 


p 

r 

«/T 

Z v 

T 

T? 

j P(w)  coal (l-k)wTldu 

1 — < " 1 
k-1 

. - 

in 

b f 


P(w)  coa(lwT)  dw,  lsisp. 


—n 


(4-10) 


We  Know  that  the  autocorrelation  function  *Uk?)  la  defined  as 
the  inverse  rouner  transform  of  the  power  spectrun,  l.e. 


NOTES 


There  are 
different 
representations 
of  the  error 
E which  are 
again  used  in 
section  VI,  with 
modfications  to 
define  measures, 
of  distortion. 
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»/7 


\ ’ h f 


p(-i 


(HI* 


or 


cosH-iT)  d*. 


(4-1  lb 


(4-llb)  follows  from  (4-lls)  because  the  power  spectrum  Is  a 
real  and  even  function  of  frequency.  Substituting  (4-1 lb ) in 
(4-10)  and  notinn  that  9 . - R^,  ve  have: 


I a — k | 


lii?p  , 


(4-12) 


which  are  the  sane  Autocorrelation  normal  aquations  as  (1-1S). 

The  minimum  total-squared  error  r.  can  be  obtained  by  using 

P 

(4-10)  and  (4-11)  in  (4-9).  The  anewer  can  be  shown  to  be  equal 
to 

P 

•s  \ ' "'1J 

which  la  identical  to  that  qiven  in  (1-20)  and  (J-1T). 

The  above  derivation  shows  that,  in  tho  Autocorrelation 
method,  the  predictor  parameters  aR  can  be  determined  if  only 
the  siqnal  power  spectrum  is  known.  In  fact  all  that  is  needed 
are  the  first  p coefficients  of  the  autocorrelation  function, 
which  can  be  computed  either  from  the  time  siqnal  (Section  1.1) 
or  from  the  power  spectrum  as  was  shown  above.  The  latter  state- 
ment will  be  the  basis  for  other  formulations  of  the  Autocorre- 
lation method  which  are  based  on  the  idea  of  estimating  the  first 
p values  of  the  autocorrelation  function  (see  Section  4.4). 

4. J The  Spectral  rnvelope  and  Analvsis-by-Synthesls 

We  shall  now  interpret  the  minimisation  of  error  in  the 
Autocorrelation  method  in  terms  of  the  estimation  of  the  spec- 
tral envelope  and  in  terms  of  analysis-by-synthesis. 

Prom  (2-2),  H(s)  can  be  wrlttan  asi 

H(a)  - . 

S(*> 

and  H(w)  - . (4-14 

3(u) 


E • A*  • A.  - 
P o 


& 
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Substituting  (4-14)  In  (4-4)  ws  obtain: 


NOTES 


|S(»)  |J 

(4-4.)  , 


\ 

E - 


A^ 
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P (w) 


iswr 


dW  4 


is  the  approximate  power  spectra  P (w'  at  defined 
and  (4-15)  reduces  to: 


(4-15) 


in 


E 


7 

*/T 


rJoL  dw. 

“(u) 


(4-16) 


Therefore,  minimizing  the  total-squared  error  r.  It  equivalent 
to  the  minimization  of  the  inteqrated  ratio  of  the  alqnal  power 
spectrum  P(w)  to  Its  approximation  p(w).  Another  way  to  look  at 
this  is  that  if  one  is  interested  in  anoroximat inq  a power  spec- 
trum P(w)  by  an  all-pole  spectrum  P(w)  then  (4-16)  is  an  error 
measure  that  can  be  used  in  optimizinq  the  approximation. 


6.21  Adequacy  of  the  All-Pole  Model 

This  issue  has  alroady  been  discussed  in  Section  2.1.  We 
have  arqued  there  that  the  all-pole  nodel  seems  to  be  quite  ade- 
quate for  speech  synthesis.  The  question  here  is  the  adequacy  of 
the  model  for  formant  extraction.  Por  the  purposes  of  speech  re- 
cognition, for  example,  one  would  ideally  want  to  be  able  to  com- 
pute the  transfer  function  of  the  vocal  tract.  This  ncans  that 
the  antiformants  as  well  as  the  formants  may  be  needed.  It  is 
reasonable  to  assvwe  that  the  all-pole  model  would  be  adequate  for 
formant  extraction  of  vowels.  (This  aeaacptlon  Is  based  on  another 
assi*vti°n>  namely  that  the  qlottal  spectrum  and  radiation  can  be 
approximated  by  poles  only.)  However,  for  sounds  such  ss  nasals 
and  fricatives,  whose  spectra  are  known  to  have  antifomants,  the 
all-polo  model  mlqht  not  yield  accurate  results  for  the  resonances 
of  the  vocal  tract,  riqure  6-1  shows  the  signal  spectrum  and  the 
linear  prediction  spectrum  (p«14)  for  the  second  I")  In  the  word 
•anyone * for  a male  speaker.  The  problem  in  lookina  at  a spectra" 


This  represents 
the  only  potential 
insurmountable 
limitation  to 
the  use  of  the 
LPC  technique 
for  determining 
the  performance 
of  a voice  channel. 
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like  this  is  in  deciding  where  the  formants  and  anti  foments  are. 
'•’here  is  no  good  way  of  making  this  decision  in  qunaral,  unless 
one  has  sore  knowledge  about  the  systen  that  produced  the  signal 
whose  spectrum  is  under  analysis.  In  fact,  the  spectral  fit  in 
rig , 6-)  is  very  adequate,  and  it  is  quite  reasonable  to  assume 
that  sore  all-pole  systen  has  those  characteristics.  However, 
from  our  knowledge  of  the  acoustics  of  the  human  speech  production 
system,  we  know  that  if  the  spectrts*  in  riq.  4-3  is  that  of  the 
sound  [n),  it  must  have  leroa  a*  well  as  poles,  tut  even  if  we 
knew  this,  how  would  the  linear  prediction  all-pole  approximation 
help  us  in  determining  the  values  of  the  formants  and  anti  formants? 
Some  of  the  poles  will  correspond  approximately  to  nasal  formants, 
which  can  be  obtained  as  described  earlier  in  this  section,  but  we 
knew  of  no  simple  manner  in  which  the  anti  formants  can  be  determined 
from  the  poles  of  the  linear  prediction  spectrum.  The  problem  is 
that  the  same  poles  must  approximate  the  effects  of  both  the  for- 
mants and  the  anti  formants. 
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fig.  4-1.  Signal  spectrum  and  linear  pradictloa  spectrum 
(p»14)  for  the  second  [ml  in  the  word  ‘anyone*. 
Period  of  analysis  is  IS  msec. 


A-21 


CIIAPTLA  VII 


CONCLUSIONS 


NOTES 


Linear  prediction  is  en  autocorrslation-dasiain  analysis. 
Therefore,  it  can  be  approached  either  from  the  tiee  or  frequency 
domain.  Although  the  actual  computations  are  performed  in  tlie 
time  domain,  we  chose  to  derive  the  most  general  formulations  for 
linear  prediction  froei  the  frequency  domain  because  of  the  domi- 
nance of  spectral  analysis  in  speech  research.  We  have  shown 
that  all  least-squares  methods  of  linear  prediction  can  be  derived 
from  a single  general  concept,  namely  that  of  generalised  analysis- 
by-synthesis.  Here  the  20-spectrum  (two  dimensional  spectrimO  of 
a nonstationary  signal  (such  as  speech)  is  to  be  approximated  by 
another  20-spectrun,  where  the  error  to  be  minimised  is  proportional 
to  the  integral  of  the  ratio  of  the  signal  spectrum  to  tie  approxi- 
mate spectrisn.  This  error  criterion  was  shown  to  be  very  desirable 
for  a good  spectral  envelope  fit.  In  the  special  case  when  the 
approximate  spectrvss  consists  of  poles  only,  the  generalised 
method  reduces  to  the  general  Covariance  method  of  linear  predic- 
tion. If,  in  addition,  the  signal  is  assumed  to  be  stationary, 
the  2D-spectrum  is  replaced  by  the  ordinary  power  spectrum,  and 
the  Covariance  method  reduces  to  the  Autocorrelation  method  of 
linear  prediction. 


Even  though  the 
technique  is 
implemented  in 
the  time  domain, 
frequency  domain 
interpretations 
can  also  be 
obtained.  How- 
ever, it  should 
be  noted  that  it 
is  more  difficult 
if  not  impossible 
to  obtain  time 
domain  interpreta 
tions  from  the 
usual  spectral 
analyses  that 
form  a vast 
contribution  to 
the  speech 
research 
literature . 
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